Predictive Modeling

Task 1

Multiple Regression

Part I: Research Question

  1. How do services and churn correlate?

  2. The goal of this analysis is to find the services that correlate with customer churn. For example, what services do the customers with the highest churn rate use?

Part II: Method Justification

  1. Assumptions of multiple regression:

There must be a linear relationship between the outcome variable and the independent variables.

Multiple regression assumes that the residuals are normally distributed.

Multiple regression assumes that the independent variables are not highly correlated with each other.

The variance of error terms are similar across the values of the independent variables.

Multiple linear regression requires at least two independent variables, which can be nominal, ordinal, or interval/ratio level variables.

  1. We are using Python because it is very effective to use for data analysis. There are also many different packages that can be added to python to significantly increase the functionality, such as pandas and numpy.
  1. Multiple regression will allow us to analyze each different type of service variable and find out the services and variables that correlate with tenure. An organization would then be able to use this information to try and decrease overall customer churn.

Part III: Data Preparation

  1. Data preparation will consist of cleaning up bad data, such as replacing nulls, removing duplicated rows, etc. We will also convert categorical variables to numerical so we can run linear regression.

  2. The target variable is Churn. The predictor variables are all variables besides customer demographics, such as city, state, age, gender, location, and job. These are variables that the company cannot control and can not change. The other variables are possible for the company to try to change to increase tenure. We will be using the services variables such as Contract, Payment Methods, and Internet Service.

  3. To prepare the data for analysis, we will be removing unneeded columns, describing the data, types, finding missing information, deleting duplicates, etc.

  1. Univariate and Bivariate visualizations
  1. Export Data

Part IV: Model Comparison and Analysis

  1. Initial regression
  1. Reducing the dataset

We can use correlation and heatmap to figure out the most important variables for regression

We are finding the values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year.

  1. Reduced regression model

Analysis

  1. Explain data analysis process

The variable selection technique was using correlation and heatmaps to figure out the variables that correlate the most with churn.

The model evaluation metrics are shown in the regression results

Plot Residuals:

Part V: Data Summary and Implications

Different coefficiants have different affects on the results of churn. Some have a higher correlation with churn and some have a lower correlation with churn. We found that values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year. Other variables have a lower correlation with churn or a higher correlation with NO churn.

The limitations of this data analysis is only running one type of model on the dataset, which can cause inacurate results. The dataset is also 10,000 rows, which is a lot, but not nearly enough to encompass all of the people that could be affected by an analysis like this. There could be so many more people and data that just were not included in this data set that could change the results.

A course of action a company could take based on these results is trying to get customers to sign up for other contract lengths rather than month-to-month, as this contract length has a high correlation with churn.

Task 2

Logistic Regression

Part I: Research Question

  1. How do services and churn correlate?

  2. The goal of this analysis is to find the services that correlate with customer churn. For example, what services do the customers with the highest churn rate use?

Part II: Method Justification

  1. Assumptions of multiple regression:

Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.

Logistic regression requires the observations to be independent of each other.

Logistic regression requires there to be little or no multicollinearity among the independent variables.

Logistic regression assumes linearity of independent variables and log odds.

Logistic regression typically requires a large sample size.

  1. We are using Python because it is very effective to use for data analysis. There are also many different packages that can be added to python to significantly increase the functionality, such as pandas and numpy.
  1. Logical regression will allow us to analyze each different type of service variable and find out the services and variables that correlate with tenure. An organization would then be able to use this information to try and decrease overall customer churn.

Part III: Data Preparation

  1. Data preparation will consist of cleaning up bad data, such as replacing nulls, removing duplicated rows, etc. We will also convert categorical variables to numerical so we can run linear regression.

  2. The target variable is Churn. The predictor variables are all variables besides customer demographics, such as city, state, age, gender, location, and job. These are variables that the company cannot control and can not change. The other variables are possible for the company to try to change to increase tenure. We will be using the services variables such as Contract, Payment Methods, and Internet Service.

  3. To prepare the data for analysis, we will be removing unneeded columns, describing the data, types, finding missing information, deleting duplicates, etc.

  1. Univariate and Bivariate visualizations

Part IV: Model Comparison and Analysis

  1. Initial regression

This initial data set shows a prediction of 84.4%

  1. Reducing the dataset

We can use correlation and heatmap to figure out the most important variables for regression

  1. Reduced regression model

We are finding the values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year.

This reduced data set shows a prediction of 89%. This is about a 5% increase over the initial data set.

  1. Explain data analysis process

The variable selection technique was using correlation and heatmaps to figure out the variables that correlate the most with churn.

The model evaluation metrics are shown by the prediction score. The reduced data set has a 5% higher prediction than the initial data set.

Part V: Data Summary and Implications

Different coefficiants have different affects on the results of churn. Some have a higher correlation with churn and some have a lower correlation with churn. We found that values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year. Other variables have a lower correlation with churn or a higher correlation with NO churn.

The limitations of this data analysis is only running one type of model on the dataset, which can cause inacurate results. The dataset is also 10,000 rows, which is a lot, but not nearly enough to encompass all of the people that could be affected by an analysis like this. There could be so many more people and data that just were not included in this data set that could change the results.

A course of action a company could take based on these results is trying to get customers to sign up for other contract lengths rather than month-to-month, as this contract length has a high correlation with churn.